Evolution of the Protein Interaction Network of Budding Yeast: 
Role of the Protein Family Compatibility Constraint 



in 
o 
o 

(N 



O 

I 

cr 



m ■ 
> ■ 

o\ : 
o . 
o ■ 

(N ■ 

m : 
o ■ 

"o : 

• i-H . 

X) ■ 
I , 

cr 



X 



K.-I. Goh, B. Kahng, and D. Kim 
School of Physics and Center for Theoretical Physics, Seoul National University NS50, Seoul 151-747, Korea 

(Dated: February 9, 2008) 

Understanding of how protein interaction networks (PIN) of living organisms have evolved or are 
organized can be the first stepping stone in unveiling how life works on a fundamental basis. Here, 
we introduce a new in-silico evolution model of the PIN of budding yeast, Saccharomyces cerevisiae; 
the model is composed of the PIN and the protein family network. The basic ingredient of the 
model includes family compatibility which constrains the potential binding ability of a protein, as 
well as the previously proposed gene duplication, divergence, and mutation. We investigate various 
structural properties of our model network with parameter values relevant to budding yeast and 
find that the model successfully reproduces the empirical data. 
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Studying complex systems by means of their net- 
work representation has attracted much attention re- 
cently m ia la, lila la • Theceii, one of the best examples 

of complex systems, can also be viewed as a network: 
The cellular components, such as genes, proteins, and 
other biological molecules, connected by all physiologi- 
cally relevant interactions, form a full weblike molecular 
architecture in a cell Among the various levels, the 

protein interaction network (PIN) plays a pivotal role as 
it acts as a basic physical protocol of cooperative func- 
tioning in many physiological processes. In the PIN, pro- 
teins are viewed as nodes, and two proteins are linked 
if they physically contact each other. Thanks to recent 
progress in high-throughput experimental techniques, the 
data set of protein interactions for budding yeast, Sac- 
charomyces cerevisiae, has been firmly established in the 

last few years HElIUHHIllHIli. Thus ' [t 
offers a good testbed to understand how it has evolved 
to form its status quo from basic evolutionary rules. In 
this paper, our aim is to introduce a simple evolutionary 
model to reproduce the structural properties of the PIN 
of budding yeast, thereby deepening our understanding 
of the driving force for cellular evolution. 

At a certain level of abstraction, one may view a pro- 
tein as an assembly of domains. It is domains that offer 
structural and functional units. They act as basic units 
in the interactions between proteins and in the evolution 
of protein structures. Proteins are grouped into so-called 
protein families or superfamilies according to the domain 
structure within them |l7| . The proteins within a family 
are monophyletic; that is, they originate from a common 
ancestor and are fairly well conserved during evolution. 
The protein family network (PFN) is defined as the one 
whose nodes are protein families, and two families are 
connected if any of the domains within them simulta- 
neously occur in a single protein or any proteins within 
them interact with each other 18j. The distributions 
of the degrees and the sizes of the families in the PFN 



also follow power laws (lil HIj ] . Given that the entities of 
proteins and protein families are not separable but linked 
via domains as intermediates, it is desirable to unify their 
evolutions into a single framework. 

So far, several in-silico evolution models have been pro- 
posed for the yeast PIN H3, HH H^, A distinguish- 
ing aspect in the evolution of the PIN compared with that 
of other complex networks is the concept of "evolution by 
duplication" : A new protein is thought to be created 
mainly by gene duplication. Subsequently, the duplicate 
protein may lose redundant interactions endowed from its 
ancestor to reduce redundancy, which is called divergence 
(or diversification). A protein also gains new interactions 
with other proteins via mutation. These three processes, 
duplication-divergence-mutation, have been regarded as 
the basic ingredients in the evolution of the PIN. While 
those in-silico models [2(1 0, |U 0, 0] were success- 
ful in generating a fat-tail or power-law behavior in the 
degree distribution, they hardly reproduced other struc- 
tural properties of the yeast PIN network, such as the 
clustering coefficient, the assortativity, etc., which we will 
specify in more detail shortly. The model we introduce 
here, however, can incorporate other structural proper- 
ties of the yeast PIN as well as the degree distribution. To 
this end, we introduce the concept of "family compatibil- 
ity" (FC): An interaction between two proteins is possible 
only when the corresponding families they belong to are 
compatible, and only those families linked via the PFN 
are compatible with one another. With this, we realize 
the effective structural constraint in physical binding be- 
tween proteins, which is coupled with the evolutionary 
lineage of proteins through the notion of protein family. 

Model — The model can be depicted schematically as 
in Fig. 1. The whole system is composed of two types of 
networks, the PIN and the PFN. A number of proteins 
are grouped, forming a protein family. Protein families 
link to other protein families, forming the PFN. Two pro- 
teins belonging to different protein families can interact 
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FIG. 1: Schematic picture of the evolution rule of the model. 
The elementary steps are composed of i) duplication (light 
blue protein — > red protein), ii) divergence (dashed pink 
links), and iii) mutation (violet link from the pink protein). 
In addition, the mutation is constrained by family compati- 
bility; for example, the pink protein cannot interact with the 
black protein because they are not compatible. 



only when the respective families are also linked. Each 
family has a fitness-like parameter, the number of do- 
mains within it, Df, which is not fixed, but evolves with 
the PFN. The evolution takes place in two stages. In the 
first stage, the protein families are created along with the 
proteins; thus, the PFN coevolves with the PIN. In the 
second stage, the PFN is kept fixed, and the evolution of 
the PIN continues on top of it. A detailed description of 
the procedure is as follows: 

1 . Initially, there are no proteins, each of which con- 
stitutes its own protein family. All uq proteins are 
interconnected with one another, as are the no pro- 
tein families. We choose no = 3 to be minimal. 
Each family has Df — 2 domains, the number of 
family-links it has. 

2. In the first stage, proteins and protein families co- 
evolve: At each step, with rate a, a new protein, 
say a, is created by duplicating an existing protein 
b chosen randomly. The new protein a creates its 
own protein family F a . Each of the inherited inter- 
actions of the protein a is removed with probability 
S, a process called divergence. Through divergence, 
the degree of the new protein a, k ai usually becomes 
less than that of the mother protein kj,. The link- 
age of the new protein family is determined by that 
of the protein created. By this process, the newly 
born family F a consists of a single protein, but has 
a number of linkages, say Kp a , to existing families. 
The initial number of domains in the family is set 
to Dp a = Kp a - In some cases, the newly created 
protein is left with no interaction at all (Kp a = 0). 
In this case, we do not let it establish a new family, 
but regard it as a remnant in the previous fam- 
ily. When this case happens, the population of the 
family to which the duplicated protein belongs is 
increased by 1. Note that the remnant can later 
gain new interactions via mutation described be- 



low and join the protein interaction network. 

With rate 1, a randomly chosen existing protein i 
gains a new interaction to another previously un- 
linked protein j, which is chosen among the pro- 
teins within compatible families, according to the 
probability, 



IE 



(1) 



where Fi means the family to which the protein i 
belongs and X <-> Y means that the families X and 
Y are compatible, i.e., linked in the PFN. Eq. (f), 
the preferential attachment in the domain abun- 
dance constrained by FC, makes our model distinct 
and successful. In this process, the mutation as we 
will call it, the number of domains in the family 
Fi increases by I , but the number of domains in Fj 
does not. This accounts for the acquisition of a new 
domain via mutation in the family Fi. This stage 
lasts until there are 1,000 proteins made, during 
which about 500^600 families arc created, a num- 
ber com par able with the number of superfamilics 
in yeast |26j 

3. In the second stage, the same protein evolution pro- 
cess as in the first stage occurs, except that the 
PFN is kept fixed and the daughter protein remains 
in the same family as its mother in the duplication 
process. This stage lasts until there are about 6,000 
proteins in the network, the approximate size of the 
yeast proteome. 

A few remarks on the model are in order. First, this 
model is designed to be as simple as possible while imple- 
menting FC into the trio of duplication, divergence, and 
mutation, which we believe to be the most basic pro- 
cesses. Many interesting processes, such as lateral gene 
transfer and de novo creation of proteins and protein fam- 
ilies, are not covered in this model, however. Second, 
we made an assumption that the time-scale of the PFN 
evolution is strictly separated, which might be an over- 
simplification. Third, proteins and protein families may 
become extinct during evolution, followed by the loss of 
the interactions between them. However, we may view 
the parameters of the evolution rates, such as a and 5, 
as effective ones incorporating all these details. Also, for 
the sake of minimizing the number of free parameters, we 
assume that the duplication and the divergence rates of 
proteins and protein families are equal, i.e., a — ctf and 
5 = Sf, although we can fix a and 6 for any given set of 
(a/, Sf) to incorporate the empirical data. 

Structure of the yeast PIN — Several analyses on the 
topological properties of the yeast PIN have been per- 
formed during recent years [271 l28l |29( . Since then, how- 
ever, new protein-protein interactions in yeast have been 
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FIG. 2: Simulation results (O) of the model agree well with the empirical data (o). Shown are (a) the degree distribution P(k), 
(b) the hierarchical clustering C(k), and (c) the average neighbor-degree function (fc nn ) for the protein interaction network. 
The dotted line in (a) is a fit to Eq. l j2| . The results of the model without FC (□), which fail to reproduce the empirical 
features, are also shown for comparison. 



discovered steadily, so we repeat the analysis by inte- 
grating the most up-to-date data from various public re- 
sources, such as (i) the database at the Munich Informa- 
tion Center for Protein Sequences Q, (ii) the database 
of the interacting proteins Ha. fiiil the biomolecular 
interaction network database fivl the two-hybrid 

datasets obtained by Uetz et al. 0, by Ito et al. [Tof . 
and by Tong et al. 0, and (v) the mass spectrome- 
try data (filtered) by Ho et al. |l3|. After trimming the 
synonyms and other redundant entries manually, the re- 
sulting network consists of 15, 652 interactions (exclud- 
ing self-interactions) between 4, 926 nodes (in terms of 
distinct open reading frames and other biomolecules). 

The topological properties of the integrated yeast PIN 
are shown in Fig. 2: 

(a) The degree distribution of the PIN fits well to the 
generalized Pareto distribution (or a generalized power 

law) mim, 



Pd (k) ~ (fc + fc r 7 , 



(2) 



with fco = 8.0 and 7 ~ 3.45. Note that different func- 
tional types of the degree distribution from Eq. were 
proposed |2jJ, 0, 0, H3, based on smaller-scale 
datasets than the current one. 

(b) The yeast PIN is highly clustered and modular. 
To quantify this, we measured the local clustering of a 
protein i, Ci = 2ei/ki(ki — 1), where ei is the number of 
links present between the ki neighbors of node i out of its 
maximum possible number ki(ki — l)/2. The clustering 
coefficient of a graph, C, is the average of Ci over all nodes 
with ki > 2. We obtain C « 0.128. C(fc]is the clustering 
function of vertices with degree k ^3^Q^. C(k) exhibits 
a plateau for small k while it drops rapidly for large k. 
Such a plateau in the clustering function may reflect the 
functional module structure within the PIN, inside which 
the network is denser due to the high cooperativity to 
perform a given cellular task. Such locally dense modules 
are interconnected by a few global mediators, which are 



likely to be the hubs in the PIN |34j . This feature is what 
most existing PIN models fail to reproduce. As we will 
show, the FC constraint that we introduce successfully 
accounts for the emergence of the plateau in C(k). 

(c) The yeast PIN shows a dissortative degree correla- 
tion. The average neighbor-degree function (k nn )(k) pjif 
is measured to be (k nn )(k) ~ k~ v with v k. 0.3, some- 
what smaller than the value reported based on a single 
two-hybrid dataset alone |2^. The assortativity r, de- 
fined as the Pearson correlation coefficient between the 
degrees of the two vertices on each side of a link [36J, is 
measured to be r « —0.13. In Table [I] we summarize 
our measurements for the topological properties of the 
integrated yeast PIN. 

Results — Now we compare the simulation results of 
our model. In typical simulations, we employed a = 0.8 
and S = 0.7. The value of S was chosen to accommodate 
the fact that superfamilies exhibit extensive sequence di- 
versity |37| . The value of a was set to match the empir- 
ical value of the average degree of the PIN, (k) ~ 6.4. 
Also, we matched approximately the numbers of protein 
families and proteins with those of budding yeast, as we 
described before. The results obtained from the model 
show good agreements with the empirical data as shown 



TABLE I: Topological quantities of the integrated yeast PIN 
and the model network. Error bars in the model results are 
the standard deviations of the quantities from 1000 runs. 



item 



model 



yeast PIN 



total number of nodes n 

number of interacting nodes N 

average degree (fe) 

clustering coefficient C 

assortativity index r 

size of the largest component A?i 



6000 
5079±54 
6.5±0.3 
0.13±0.02 



f»6000 
4926 
6.35 
0.128 



-0.09±0.04 -0.13 
5051±53 4832 



4 




log 10 £ logio^ 

FIG. 3: (a) Comparison between the degree correlation profiles of the yeast PIN and (b) the model network. The color code 
denotes the value of log 10 [P(fc, fc')/P ran( jom(fc, &')]. The randomized networks are generated by the switching method [2jJ that 
conserves the degree sequence. 





FIG. 4: Network randomization test with and without FC. (a) 
Clustering function C(fc) and (b) the clustering coefficient C 
as functions of the number of edge shufflings are shown. Sym- 
bols are for the unperturbed model network (0)> the network 
shuffled with FC (o), and the network shuffled without FC 
(□). The horizontal line in (b) corresponds to the value of 
the clustering coefficient in the unperturbed model network. 



in Fig. 2 and Table [I] In Fig. 2, we also show the re- 
sults with the model without implementing FC, which is 
similar to the model of Sole et al. [2(J. One can clearly 
see that without FC, we cannot account for the clustering 
and the degree correlation characteristics. We also exam- 
ine the full degree-correlation profile of the joint proba- 
bility P(k, k') that two proteins with degrees k and k! 
are connected to each other. The degree-correlation in- 
tensity is quantified by P(k,k')/P lan a m(k,k'), the ratio 
with the joint probability in the randomized ensemble of 
the original network 29, 38]. As shown in Fig. 3, the pro- 
file obtained from the model has a pattern that is quite 
similar to that of the empirical yeast PIN. 

To get further support for the relevance of the FC con- 
straint, we performed a network randomization test. We 
randomized the model network by using the conventional 
edge switching method 29], but with the FC constraint. 
That is, when we are to switch the interactions between 
the protein pairs, only the switching attempts that pre- 
serve FC are accepted. In this way, we can filter out the 



FIG. 5: Simulation results for the protein family network: (a) 
The family degree distribution pd(kp) and (b) the family size 
distribution p s (sf). The dotted lines in (a) and (b) are fit 
fines to Eq. @. 



role of FC. In Fig. 4, we show the results of randomiza- 
tion. Wc find that the high clustering property of the 
network is preserved with randomization with FC, but 
not without FC. Without FC, the clustering coefficient 
drops as soon as we shuffle the network, as can be seen in 
Fig. 4(b). Thus, wc conclude FC, indeed, plays a crucial 
role in PIN evolution. 

Finally, we check the properties of the PFN. In Fig. 5, 
we show the degree distribution of the PFN and the fam- 
ily size distribution generated in silico. The degree dis- 
tribution of the PFN follows a similar form to Eq. (2), 
but with a different value of the exponent, 7j k 3. The 
family size distribution also follows a power law with an 
exponent of 3~4. 

In summary, we have introduced an in-silico model for 
PIN evolution. The model network is composed of the 
PIN and the PFN. In the early stage of evolution, the 
PIN and the PFN coevolve, and in the later stage, the 
PFN becomes fixed. The evolution proceeds by the three 
major mechanisms previously proposed, duplication, 
divergence, and mutation. However, it is constrained 
by FC and follows a modified preferential attachment 
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rule in the domain abundance, which is the new feature 
of our model. We have checked various structural 
properties of the model network, finding that they show 
good agreements with those of the integrated empirical 
data of the yeast PIN. Finally, it would be interesting to 
apply our model to higher eukaryotes, as the data for the 
protein interactions are accumulating for the multicellu- 
lar species such as the nematode worm Caenorhabditis 
elegans and the fruit fly Drosophila melanogater j^jj • 
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